header%20ipynb.png

Daftar Isi:

  1. Memuat Data
  2. Menelaah Data
Catatan : Jika belum pernah instal library, gunakan perintah berikut secara inline atau melalui terminal.
  1. !pip install pandas
  2. !pip install numpy

Memuat Data


In [1]:
# import library
import pandas as pd
import numpy as np
In [2]:
# load dataset
path = "epl-goalscorer(20-21).csv"
df = pd.read_csv(path)
In [3]:
# menampilkan 5 baris awal dari dataset
df.head()
Out[3]:
Unnamed: 0 id player_name games time goals xG assists xA shots key_passes yellow_cards red_cards position team_title npg npxG xGChain xGBuildup
0 0 647 Harry Kane 35 3097 23 22.174859 14 7.577094 138 49 1 0 F Tottenham 19 19.130183 24.995648 4.451257
1 1 1250 Mohamed Salah 37 3085 22 20.250847 5 6.528526 126 55 0 0 F M S Liverpool 16 15.683834 28.968234 9.800236
2 2 1228 Bruno Fernandes 37 3117 18 16.019454 12 11.474996 121 95 6 0 M S Manchester United 9 8.407840 26.911412 11.932285
3 3 453 Son Heung-Min 37 3139 17 11.023287 10 9.512992 68 75 0 0 F M S Tottenham 16 10.262118 20.671916 6.608751
4 4 822 Patrick Bamford 38 3085 17 18.401863 7 3.782247 107 30 3 0 F S Leeds 15 16.879525 23.394953 4.131796
In [4]:
# menampilkan 5 baris akhir dari dataset
df.tail()
Out[4]:
Unnamed: 0 id player_name games time goals xG assists xA shots key_passes yellow_cards red_cards position team_title npg npxG xGChain xGBuildup
517 517 9415 Jaden Philogene-Bidace 1 1 0 0.000000 0 0.0 0 0 0 0 S Aston Villa 0 0.000000 0.056044 0.056044
518 518 9423 Gaetano Berardi 2 113 0 0.074761 0 0.0 1 0 0 0 D S Leeds 0 0.074761 0.231278 0.231278
519 519 9524 Anthony Elanga 1 67 0 0.000000 0 0.0 0 0 0 0 M Manchester United 0 0.000000 0.000000 0.000000
520 520 9540 Femi Seriki 1 1 0 0.000000 0 0.0 0 0 0 0 S Sheffield United 0 0.000000 0.000000 0.000000
521 521 9552 Tyrese Francois 1 13 0 0.000000 0 0.0 0 0 0 0 S Fulham 0 0.000000 0.000000 0.000000

Menelaah Data


In [5]:
# mengungkap tipe-tipe data dari setiap kolom
print(df.dtypes)
Unnamed: 0        int64
id                int64
player_name      object
games             int64
time              int64
goals             int64
xG              float64
assists           int64
xA              float64
shots             int64
key_passes        int64
yellow_cards      int64
red_cards         int64
position         object
team_title       object
npg               int64
npxG            float64
xGChain         float64
xGBuildup       float64
dtype: object

Dari hasil output diatas terlihat bahwa 2 kolom pertama (yaitu: 'Unnamed: 0', dan 'id') hanyalah data numerik yang biasanya tidak memiliki makna rill, sehingga kita bisa tidak gunakan kolom/field tsb.

In [6]:
df_noid = df.iloc[:,2:]
df_noid
Out[6]:
player_name games time goals xG assists xA shots key_passes yellow_cards red_cards position team_title npg npxG xGChain xGBuildup
0 Harry Kane 35 3097 23 22.174859 14 7.577094 138 49 1 0 F Tottenham 19 19.130183 24.995648 4.451257
1 Mohamed Salah 37 3085 22 20.250847 5 6.528526 126 55 0 0 F M S Liverpool 16 15.683834 28.968234 9.800236
2 Bruno Fernandes 37 3117 18 16.019454 12 11.474996 121 95 6 0 M S Manchester United 9 8.407840 26.911412 11.932285
3 Son Heung-Min 37 3139 17 11.023287 10 9.512992 68 75 0 0 F M S Tottenham 16 10.262118 20.671916 6.608751
4 Patrick Bamford 38 3085 17 18.401863 7 3.782247 107 30 3 0 F S Leeds 15 16.879525 23.394953 4.131796
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
517 Jaden Philogene-Bidace 1 1 0 0.000000 0 0.000000 0 0 0 0 S Aston Villa 0 0.000000 0.056044 0.056044
518 Gaetano Berardi 2 113 0 0.074761 0 0.000000 1 0 0 0 D S Leeds 0 0.074761 0.231278 0.231278
519 Anthony Elanga 1 67 0 0.000000 0 0.000000 0 0 0 0 M Manchester United 0 0.000000 0.000000 0.000000
520 Femi Seriki 1 1 0 0.000000 0 0.000000 0 0 0 0 S Sheffield United 0 0.000000 0.000000 0.000000
521 Tyrese Francois 1 13 0 0.000000 0 0.000000 0 0 0 0 S Fulham 0 0.000000 0.000000 0.000000

522 rows × 17 columns

In [7]:
# menampilkan statistik dasar setiap kolom data yang bertipe numerik.
df_noid.describe()
Out[7]:
games time goals xG assists xA shots key_passes yellow_cards red_cards npg npxG xGChain xGBuildup
count 522.000000 522.000000 522.000000 522.000000 522.000000 522.000000 522.000000 522.000000 522.000000 522.000000 522.000000 522.000000 522.000000 522.000000
mean 19.643678 1420.068966 1.862069 2.000806 1.289272 1.376029 17.379310 12.963602 2.061303 0.091954 1.668582 1.821450 5.663368 3.455060
std 11.619836 1031.604819 3.338851 3.317946 2.083350 1.886510 21.572664 16.164361 2.203661 0.295800 2.909929 2.931176 5.600249 3.376584
min 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 10.000000 470.250000 0.000000 0.074668 0.000000 0.049245 2.000000 1.000000 0.000000 0.000000 0.000000 0.074668 1.191391 0.720353
50% 21.000000 1342.000000 1.000000 0.737295 0.000000 0.691122 10.000000 7.000000 2.000000 0.000000 0.500000 0.715585 4.252738 2.656397
75% 30.000000 2319.000000 2.000000 2.053378 2.000000 2.050509 23.750000 19.000000 3.000000 0.000000 2.000000 1.945799 8.308002 5.254647
max 38.000000 3420.000000 23.000000 22.174859 14.000000 11.474996 138.000000 95.000000 12.000000 2.000000 19.000000 19.130183 28.968234 18.323006
In [8]:
# jika ingin menampilkan juga statistik kolom- 
# yang bertipe non-numerik, mencakup beberapa banyak nilai unik- 
# dalam kolom (unique), nilai modus(top), serta frekuensi modus(freq)
df_noid.describe(include="all")
Out[8]:
player_name games time goals xG assists xA shots key_passes yellow_cards red_cards position team_title npg npxG xGChain xGBuildup
count 522 522.000000 522.000000 522.000000 522.000000 522.000000 522.000000 522.000000 522.000000 522.000000 522.000000 522 522 522.000000 522.000000 522.000000 522.000000
unique 522 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 14 28 NaN NaN NaN NaN
top Harry Kane NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN M S Everton NaN NaN NaN NaN
freq 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 106 28 NaN NaN NaN NaN
mean NaN 19.643678 1420.068966 1.862069 2.000806 1.289272 1.376029 17.379310 12.963602 2.061303 0.091954 NaN NaN 1.668582 1.821450 5.663368 3.455060
std NaN 11.619836 1031.604819 3.338851 3.317946 2.083350 1.886510 21.572664 16.164361 2.203661 0.295800 NaN NaN 2.909929 2.931176 5.600249 3.376584
min NaN 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 NaN NaN 0.000000 0.000000 0.000000 0.000000
25% NaN 10.000000 470.250000 0.000000 0.074668 0.000000 0.049245 2.000000 1.000000 0.000000 0.000000 NaN NaN 0.000000 0.074668 1.191391 0.720353
50% NaN 21.000000 1342.000000 1.000000 0.737295 0.000000 0.691122 10.000000 7.000000 2.000000 0.000000 NaN NaN 0.500000 0.715585 4.252738 2.656397
75% NaN 30.000000 2319.000000 2.000000 2.053378 2.000000 2.050509 23.750000 19.000000 3.000000 0.000000 NaN NaN 2.000000 1.945799 8.308002 5.254647
max NaN 38.000000 3420.000000 23.000000 22.174859 14.000000 11.474996 138.000000 95.000000 12.000000 2.000000 NaN NaN 19.000000 19.130183 28.968234 18.323006

Contoh fungsi statistik setiap kolom (yang applicable)


In [9]:
df_noid.mean()
C:\Users\andii\AppData\Local\Temp/ipykernel_9320/3397690596.py:1: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.
  df_noid.mean()
Out[9]:
games             19.643678
time            1420.068966
goals              1.862069
xG                 2.000806
assists            1.289272
xA                 1.376029
shots             17.379310
key_passes        12.963602
yellow_cards       2.061303
red_cards          0.091954
npg                1.668582
npxG               1.821450
xGChain            5.663368
xGBuildup          3.455060
dtype: float64
In [10]:
df_noid.sum()
Out[10]:
player_name     Harry KaneMohamed SalahBruno FernandesSon Heun...
games                                                       10254
time                                                       741276
goals                                                         972
xG                                                    1044.420572
assists                                                       673
xA                                                     718.287269
shots                                                        9072
key_passes                                                   6767
yellow_cards                                                 1076
red_cards                                                      48
position        FF M SM SF M SF SF SF SFM SF M SF SF SF SF SF ...
team_title      TottenhamLiverpoolManchester UnitedTottenhamLe...
npg                                                           871
npxG                                                     950.7971
xGChain                                               2956.278233
xGBuildup                                             1803.541131
dtype: object
In [11]:
df_noid.median()
C:\Users\andii\AppData\Local\Temp/ipykernel_9320/1530929563.py:1: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.
  df_noid.median()
Out[11]:
games             21.000000
time            1342.000000
goals              1.000000
xG                 0.737295
assists            0.000000
xA                 0.691122
shots             10.000000
key_passes         7.000000
yellow_cards       2.000000
red_cards          0.000000
npg                0.500000
npxG               0.715585
xGChain            4.252738
xGBuildup          2.656397
dtype: float64
In [12]:
df_noid.var()
C:\Users\andii\AppData\Local\Temp/ipykernel_9320/3820954888.py:1: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.
  df_noid.var()
Out[12]:
games           1.350206e+02
time            1.064209e+06
goals           1.114793e+01
xG              1.100877e+01
assists         4.340345e+00
xA              3.558919e+00
shots           4.653798e+02
key_passes      2.612866e+02
yellow_cards    4.856120e+00
red_cards       8.749752e-02
npg             8.467687e+00
npxG            8.591795e+00
xGChain         3.136279e+01
xGBuildup       1.140132e+01
dtype: float64
In [13]:
df_noid.std()
C:\Users\andii\AppData\Local\Temp/ipykernel_9320/3048348673.py:1: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.
  df_noid.std()
Out[13]:
games             11.619836
time            1031.604819
goals              3.338851
xG                 3.317946
assists            2.083350
xA                 1.886510
shots             21.572664
key_passes        16.164361
yellow_cards       2.203661
red_cards          0.295800
npg                2.909929
npxG               2.931176
xGChain            5.600249
xGBuildup          3.376584
dtype: float64
In [14]:
df_noid.quantile(0.75)
Out[14]:
games             30.000000
time            2319.000000
goals              2.000000
xG                 2.053378
assists            2.000000
xA                 2.050509
shots             23.750000
key_passes        19.000000
yellow_cards       3.000000
red_cards          0.000000
npg                2.000000
npxG               1.945799
xGChain            8.308002
xGBuildup          5.254647
Name: 0.75, dtype: float64

Mencari pencilan dengan Tukey's fences (1)


In [15]:
q1 = df_noid.quantile(0.25)
q3 = df_noid.quantile(0.75)
iqr = q3 - q1
iqr
Out[15]:
games             20.000000
time            1848.750000
goals              2.000000
xG                 1.978711
assists            2.000000
xA                 2.001264
shots             21.750000
key_passes        18.000000
yellow_cards       3.000000
red_cards          0.000000
npg                2.000000
npxG               1.871131
xGChain            7.116612
xGBuildup          4.534294
dtype: float64

Mencari pencilan dengan Tukey's fences (2)


In [16]:
# handle warning 
import warnings 
warnings.filterwarnings('ignore')

# outlier filter
df_noid_align, iqr_new = df_noid.align(iqr, axis=1, copy=False, join='outer')
outlier_filter = (df_noid < q1 - 1.5 * iqr_new) | (df_noid > q3 + 1.5 * iqr_new)
outlier_filter
Out[16]:
assists games goals key_passes npg npxG player_name position red_cards shots team_title time xA xG xGBuildup xGChain yellow_cards
0 True False True True True True False False False True False False True True False True False
1 False False True True True True False False False True False False True True False True False
2 True False True True True True False False False True False False True True False True False
3 True False True True True True False False False True False False True True False True False
4 True False True False True True False False False True False False False True False True False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
517 False False False False False False False False False False False False False False False False False
518 False False False False False False False False False False False False False False False False False
519 False False False False False False False False False False False False False False False False False
520 False False False False False False False False False False False False False False False False False
521 False False False False False False False False False False False False False False False False False

522 rows × 17 columns

Mencari pencilan dengan Tukey's fences (3)


In [17]:
df_noid[outlier_filter['assists']] \
    .loc[:, ['player_name', 'assists']] \
    .sort_values(by=['assists'], ascending=False)
Out[17]:
player_name assists
0 Harry Kane 14
2 Bruno Fernandes 12
58 Kevin De Bruyne 11
3 Son Heung-Min 10
51 Jack Grealish 10
6 Jamie Vardy 9
15 Marcus Rashford 9
57 Raphinha 9
41 Jack Harrison 8
281 Aaron Cresswell 8
83 Pascal Groß 8
49 Timo Werner 8
32 James Ward-Prowse 7
26 Roberto Firmino 7
16 Sadio Mané 7
130 Trent Alexander-Arnold 7
204 Andrew Robertson 7
4 Patrick Bamford 7
358 Lucas Digne 7
42 Bertrand Traoré 6
22 Raheem Sterling 6
18 Matheus Pereira 6
62 Pedro Neto 6
79 Eberechi Eze 6
137 Bernardo Silva 6
211 Said Benrahma 6
495 Vladimir Coufal 6

Value_counts()


menghasilkan frekuensi setiap nilai unik di dalam kolom, yang tertinggi count-nya merupakan modus pada kolom tsb. Ada data dengan dua/tiga nama tim karena ada pemain yang bermain di dua/tiga klub dalam musim yang sama(ada transfer pemain)

In [18]:
df_noid['team_title'].value_counts()
Out[18]:
Everton                          28
West Bromwich Albion             28
Sheffield United                 27
Southampton                      27
Manchester United                27
Fulham                           27
Leicester                        27
Wolverhampton Wanderers          27
Liverpool                        27
Brighton                         26
Arsenal                          26
Newcastle United                 26
Burnley                          25
Chelsea                          25
Crystal Palace                   24
Manchester City                  24
Tottenham                        24
West Ham                         23
Aston Villa                      23
Leeds                            23
Arsenal,Newcastle United          1
Everton,Southampton               1
Aston Villa,Chelsea               1
Liverpool,Southampton             1
Chelsea,Fulham                    1
West Bromwich Albion,West Ham     1
Arsenal,West Bromwich Albion      1
Arsenal,Brighton                  1
Name: team_title, dtype: int64

Analisis dengan groupby


method groupby memungkinkan analisa dilakukan secara per kelompok nilai atribut tertentu. misal: rerata dan simpangan baku gol per tim

In [19]:
df.groupby('team_title')['goals'].std()
Out[19]:
team_title
Arsenal                          3.352381
Arsenal,Brighton                      NaN
Arsenal,Newcastle United              NaN
Arsenal,West Bromwich Albion          NaN
Aston Villa                      3.696489
Aston Villa,Chelsea                   NaN
Brighton                         2.158703
Burnley                          2.475210
Chelsea                          2.350177
Chelsea,Fulham                        NaN
Crystal Palace                   2.901461
Everton                          3.467727
Everton,Southampton                   NaN
Fulham                           1.439175
Leeds                            4.153193
Leicester                        4.020602
Liverpool                        4.931439
Liverpool,Southampton                 NaN
Manchester City                  3.867132
Manchester United                4.317855
Newcastle United                 2.483174
Sheffield United                 1.467599
Southampton                      3.141941
Tottenham                        5.855135
West Bromwich Albion             2.310260
West Bromwich Albion,West Ham         NaN
West Ham                         3.369240
Wolverhampton Wanderers          1.648620
Name: goals, dtype: float64
In [20]:
df.groupby('team_title')['goals'].mean()
Out[20]:
team_title
Arsenal                          1.961538
Arsenal,Brighton                 0.000000
Arsenal,Newcastle United         8.000000
Arsenal,West Bromwich Albion     0.000000
Aston Villa                      2.130435
Aston Villa,Chelsea              3.000000
Brighton                         1.500000
Burnley                          1.280000
Chelsea                          2.240000
Chelsea,Fulham                   1.000000
Crystal Palace                   1.625000
Everton                          1.607143
Everton,Southampton              3.000000
Fulham                           0.925926
Leeds                            2.608696
Leicester                        2.370370
Liverpool                        2.370370
Liverpool,Southampton            3.000000
Manchester City                  3.208333
Manchester United                2.518519
Newcastle United                 1.384615
Sheffield United                 0.666667
Southampton                      1.555556
Tottenham                        2.750000
West Bromwich Albion             1.178571
West Bromwich Albion,West Ham    0.000000
West Ham                         2.478261
Wolverhampton Wanderers          1.222222
Name: goals, dtype: float64

Korelasi Pearson antara kolom-kolom numerik


  • Method corr() menghasilkan tabel korelasi pearson antar kolom-kolom numerik.
  • Rentang nilai: antara -1 dan 1
  • -1 = korelasi negatif | 0 = tidak ada korelasi linier | +1 = korelasi positif
In [21]:
df_noid.loc[:,'games':].corr()
Out[21]:
games time goals xG assists xA shots key_passes yellow_cards red_cards npg npxG xGChain xGBuildup
games 1.000000 0.944591 0.439730 0.463869 0.504168 0.562806 0.599164 0.617867 0.565963 0.160326 0.437110 0.465546 0.726598 0.697196
time 0.944591 1.000000 0.398930 0.411203 0.473555 0.516638 0.529534 0.575065 0.592223 0.186333 0.392631 0.408231 0.703801 0.731377
goals 0.439730 0.398930 1.000000 0.932798 0.617490 0.607330 0.873363 0.567752 0.097151 0.053679 0.971591 0.905710 0.727953 0.290990
xG 0.463869 0.411203 0.932798 1.000000 0.636205 0.627495 0.910214 0.570488 0.093761 0.048815 0.894286 0.979218 0.763909 0.282746
assists 0.504168 0.473555 0.617490 0.636205 1.000000 0.885850 0.721220 0.835299 0.209349 -0.021444 0.587316 0.615503 0.752587 0.473254
xA 0.562806 0.516638 0.607330 0.627495 0.885850 1.000000 0.759568 0.946506 0.243912 0.006284 0.585152 0.611100 0.814487 0.547983
shots 0.599164 0.529534 0.873363 0.910214 0.721220 0.759568 1.000000 0.743370 0.249957 0.073932 0.852989 0.901386 0.843152 0.448197
key_passes 0.617867 0.575065 0.567752 0.570488 0.835299 0.946506 0.743370 1.000000 0.343357 0.022780 0.539726 0.545537 0.807958 0.618754
yellow_cards 0.565963 0.592223 0.097151 0.093761 0.209349 0.243912 0.249957 0.343357 1.000000 0.165064 0.093270 0.089065 0.401884 0.562467
red_cards 0.160326 0.186333 0.053679 0.048815 -0.021444 0.006284 0.073932 0.022780 0.165064 1.000000 0.055542 0.047354 0.104005 0.167660
npg 0.437110 0.392631 0.971591 0.894286 0.587316 0.585152 0.852989 0.539726 0.093270 0.055542 1.000000 0.913496 0.720978 0.284135
npxG 0.465546 0.408231 0.905710 0.979218 0.615503 0.611100 0.901386 0.545537 0.089065 0.047354 0.913496 1.000000 0.763481 0.273090
xGChain 0.726598 0.703801 0.727953 0.763909 0.752587 0.814487 0.843152 0.807958 0.401884 0.104005 0.720978 0.763481 1.000000 0.802073
xGBuildup 0.697196 0.731377 0.290990 0.282746 0.473254 0.547983 0.448197 0.618754 0.562467 0.167660 0.284135 0.273090 0.802073 1.000000